Review of Statistical Estimation

STA4173: Biostatistics

January 1, 2025

Introduction

  • In this lecture, we will review summary statistics

    • Continuous variables

      • Mean
      • Median
      • Percentiles / quartiles
      • Variance and standard deviation
      • Interquartile range
    • Categorical variables

      • Count
      • Percentage

Summary Statistics: Introduction

  • In this course, we will review formulas, but we will use R for computational purposes

    • Remember to refer to the lecture notes for specific code needed

    • Code is also available on this course’s GitHub repository

  • We can use base R for some things, but I try to stay in the tidyverse when possible.

  • If we need to install packages, we use the install.packages() function,

install.packages("package name")
  • To call packages in, we use the library() function,
library(tidyverse)

Summary Statistics: Mean

  • Definition: sample arithmetic mean

\bar{x} = \frac{\sum_{i=1}^n x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}

  • R syntax:
# method 1: base R
mean(vector_name) # if your variable is not part of a dataset
mean(dataset_name$variable_name) # if your variable is part of a dataset

# method 2: tidyverse
dataset_name %>% summarize(mean(variable_name))

Summary Statistics: Mean

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the average exam score.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
mean(scores)
[1] 79


Summary Statistics: Mean

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the average exam score.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
mean(scores)
[1] 79


  • What happens to the mean if there is an extreme observation?
82, 77, 90, 71, 26, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 26, 68, 74, 84, 94, 88)
mean(scores)
[1] 75.4

Summary Statistics: Variance and Standard Deviation

  • Definition: sample variance

s^2 = \frac{\sum_{i=1}^n x_i^2 - \frac{(\sum_{i=1}^n x_i)^2}{n}}{n-1}

  • Definition: sample standard deviation

s = \sqrt{s^2}

  • R syntax:
# method 1: base R
var(vector_name) or sd(vector_name) # if your variable is not part of a dataset
var(dataset_name$variable_name) or sd(dataset_name$variable_name) # if your variable is part of a dataset

# method 2: tidyverse
dataset_name %>% summarize(var(variable_name), sd(variable_name))

Summary Statistics: Variance and Standard Deviation

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the variance and standard deviation.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
var(scores)
[1] 107.1111
sd(scores)
[1] 10.34945

Summary Statistics: Variance and Standard Deviation

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the standard deviation.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
sd(scores)
[1] 10.34945


  • What happens to the standard deviation if there is an extreme observation?
82, 77, 90, 71, 26, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 26, 68, 74, 84, 94, 88)
sd(scores)
[1] 19.30573

Summary Statistics: Median

  • Definition: median

    • The value that lies in the middle of the data when arranged in ascending order.

      • If n is odd, then the median is literally the middle number.

      • If n is even, then the median is the average of the two middle numbers.

  • R syntax:

# method 1: base R
median(vector_name) # if your variable is not part of a dataset
median(dataset_name$variable_name) # if your variable is part of a dataset

# method 2: tidyverse
dataset_name %>% summarize(median(variable_name))

Summary Statistics: Median

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the median exam score.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
median(scores)
[1] 79.5


Summary Statistics: Median

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the median exam score.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
median(scores)
[1] 79.5


  • What happens to the median if there is an extreme observation?
82, 77, 90, 71, 26, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 26, 68, 74, 84, 94, 88)
median(scores)
[1] 79.5

Summary Statistics: Percentiles

  • Definition: kth percentile, Pk

    • k% of the observations in the dataset are less than or equal to that value.
  • R syntax:
quantile(vector_name, percentile)  # if your variable is not part of a dataset
quantile(dataset_name$variable_name, percentile) # if your variable is part of a dataset

Summary Statistics: Percentiles

  • The following data represent the first exam score of 10 randomly selected students in STA2023.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
  • What is the 45th percentile?
quantile(scores, 0.45)
  45% 
77.25 
  • What is the 90th percentile?
quantile(scores, 0.90)
 90% 
90.4 

Summary Statistics: Quartiles

  • Definition: quartiles

    • Values that divide the dataset into fourths, or four equal parts: P25, P50 (median), P75
  • Definition: five number summary

    • Minimum, P25, P50 (median), P75, maximum
  • R Syntax:
quantile(vector_name, c(0.00, 0.25, 0.50, 0.75, 1.00))  # if your variable is not part of a dataset
quantile(dataset_name$variable_name, c(0.00, 0.25, 0.50, 0.75, 1.00)) # if your variable is part of a dataset

Summary Statistics: Quartiles

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the five number summary.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
quantile(scores, c(0.00, 0.25, 0.50, 0.75, 1))
   0%   25%   50%   75%  100% 
62.00 71.75 79.50 87.00 94.00 

Summary Statistics: Quartiles

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the five number summary.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
quantile(scores, c(0.00, 0.25, 0.50, 0.75, 1))
   0%   25%   50%   75%  100% 
62.00 71.75 79.50 87.00 94.00 


  • What happens to the five number summary if there is an extreme observation?
82, 77, 90, 71, 26, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 26, 68, 74, 84, 94, 88)
quantile(scores, c(0.00, 0.25, 0.50, 0.75, 1))
   0%   25%   50%   75%  100% 
26.00 71.75 79.50 87.00 94.00 

Summary Statistics: Interquartile Range

  • Definition: interquartile range

    • A measure of the spread of the middle half of the data
P75P25
  • R syntax:
# method 1: base R
IQR(vector_name) # if your variable is not part of a dataset
IQR(dataset_name$variable_name) # if your variable is part of a dataset

# method 2: tidyverse
dataset_name %>% summarize(IQR(variable_name))

Summary Statistics: Interquartile Range

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the IQR.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
IQR(scores)
[1] 15.25

Summary Statistics: Interquartile Range

  • The following data represent the first exam score of 10 randomly selected students in STA2023. Find the IQR.
82, 77, 90, 71, 62, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 62, 68, 74, 84, 94, 88)
IQR(scores)
[1] 15.25


  • What happens to the IQR if there is an extreme observation?
82, 77, 90, 71, 26, 68, 74, 84, 94, 88
scores <- c(82, 77, 90, 71, 26, 68, 74, 84, 94, 88)
IQR(scores)
[1] 15.25

Summary Statistics: Frequency Tables

  • When we are dealing with categorical data, we summarize using frequency tables.

    • We are interested in the count and percentage
  • e.g., from the UWF Fact Book, in Fall 2021, there were

    • 1273 (14.4%) freshmen
    • 1349 (15.2%) sophomores
    • 2431 (27.4%) juniors
    • 3807 (43.0%) seniors
  • R syntax:

dataset_name %>%
  group_by(var1_name, var2_name) %>% # do not change anything under this
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

Summary Statistics: Frequency Tables

  • Consider the Motor Trends car road tests data, built into R.

  • The data was extracted from the 1974 Motor Trend magazine, and includes aspects of car design and performance for 32 cars (1973-74 models).

data("mtcars")
head(mtcars, n=5)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Summary Statistics: Frequency Table

  • Let’s find the frequency tables for the number of forward gears (gear) and the type of transmission (am; automatic vs. manual),
mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
# A tibble: 4 × 4
# Groups:   am [2]
     am  gear     n  freq
  <dbl> <dbl> <int> <dbl>
1     0     3    15 0.789
2     0     4     4 0.211
3     1     4     8 0.615
4     1     5     5 0.385
  • Note that these are overall frequencies.

Data Visualization: Introduction

  • When presenting results to others, sometimes it is helpful to create a visualization.

  • Continuous data:

    • Histogram (one variable)
    • Scatterplot (two variables)
  • Categorical data:

    • Bar charts
  • Related to analyses:

    • Confidence intervals
    • Regression lines
  • We can also use color to incorporate other variables

    • e.g., scatterplot with weight on y-axis, age on x-axis, markers colored by diabetic status

Data Visualization: Introduction

  • We will use the ggplot2 package for most of our graphing needs.

    • This package is loaded in with the tidyverse package.
  • A good reference book is the official ggplot2: elegant graphics for data analysis text.

    • You can buy a hard copy, however, it is available online for free.
  • I will often google keywords + ggplot2 and look for examples that provide code.

    • e.g., “histogram ggplot2” led me to this website

    • e.g., “change color of dot ggplot2” led me to this website

    • Sometimes I have to look at several links before I find what I am looking for.

Data Visualization: Introduction

  • We start with the ggplot() function to specify our underlying canvas.
ggplot()

Data Visualization: Introduction

  • We will use the tidyverse pipe operator (%>%) to pipe data into the ggplot() function.
dataset %>% ggplot()
  • Then, we will specify the aesthetics using aes() inside of ggplot().
mtcars %>% ggplot(aes(x = hp, y = mpg))

Data Visualization: Introduction

  • We will add elements to our graph using geom_ functions.

    • geom_line() creates a line
    • geom_point() creates a scatterplot
    • geom_bar() creates a bar chart
    • geom_text() puts text on the graph
    • You can find a list of geom_ functions on the tidyverse website
  • The order that you add them matters!

    • geom_line() + geom_point() = points on top of line
    • geom_point() + geom_line() = line on top of points

Data Visualization: Introduction

  • We can also customize every aspect of our graphs.

    • e.g., the default background is gray, but I personally do not like it, so I typically use theme_minimal() or theme_bw() to give a white background

    • e.g., we can increase the font size to make things readable

    • e.g., we can specify colors for: markers (dots/points), outline of a bar chart or histogram, filling of a bar chart or histogram, lines, text, etc.

  • There are additional functions within other (non-tidyverse) packages that will help us with customization.

Data Visualization: Introduction

  • I do not expect you to become an expert in data visualization

  • As with other R code, I will provide basic code during lecture

  • I do encourage curiosity and exploring further

  • R is a very, very powerful tool for graphing!

    • Even before I was An Official R Programmer©, I used ggplot2 to construct graphs.

    • Other programs are just not great. :(

  • Today we will look at graphs that go along with summary statistics, but we will learn other ways to graph data as we progress through the semester.

Data Visualization: Histogram

ex1 %>% 
  ggplot(aes(x=value)) + 
  geom_histogram(bins=50) +
  labs(x = "value of variable",
       y = "number of observations") +
  theme_bw() 

Data Visualization: Histogram

ex2 %>% 
  ggplot(aes(x=value)) + 
  geom_histogram(bins=50) +
  labs(x = "value of variable",
       y = "number of observations") +
  theme_bw() 

Data Visualization: Histogram

ex3 %>% 
  ggplot(aes(x=value)) + 
  geom_histogram(bins=50) +
  labs(x = "value of variable",
       y = "number of observations") +
  theme_bw() 

Data Visualization: Histogram

ex4 %>% 
  ggplot(aes(x=value)) + 
  geom_histogram(binwidth = 0.1) +
  labs(x = "value of variable",
       y = "number of observations") +
  theme_bw() 

Data Visualization: Histogram

ex4 %>% 
  ggplot(aes(x=value)) + 
  geom_histogram(binwidth = 0.1, color = "#003865", fill = "#8DC8E8") +
  labs(x = "value of variable",
       y = "number of observations") +
  theme_bw() 

Data Visualization: Scatterplot

mtcars %>% 
  ggplot(aes(y = mpg, x = hp)) + 
  geom_point() + 
  labs(x = "Horsepower",
       y = "Gas Mileage") +
  theme_bw() 

Data Visualization: Scatterplot

mtcars %>% 
  ggplot(aes(y = mpg, x = hp)) + 
  geom_point(size = 5) + 
  labs(x = "Horsepower",
       y = "Gas Mileage") +
  theme_bw() 

Data Visualization: Scatterplot

mtcars %>% 
  ggplot(aes(y = mpg, x = hp, color = am)) + 
  geom_point(size = 5) + 
  labs(x = "Horsepower",
       y = "Gas Mileage") +
  theme_bw() 

Data Visualization: Scatterplot

mtcars %>% 
  ggplot(aes(y = mpg, x = hp, color = as.factor(am))) + 
  geom_point(size = 5) + 
  labs(x = "Horsepower",
       y = "Gas Mileage") +
  theme_bw() 

Data Visualization: Scatterplot

mtcars %>% 
  ggplot(aes(y = mpg, x = hp, color = as.factor(am))) + 
  geom_point(size = 5) + 
  labs(x = "Horsepower",
       y = "Gas Mileage",
       color = "Transmission") +
  scale_color_manual(labels = c("Automatic", "Manual"),
                     values = c("#003865", "#8DC8E8")) +
  theme_bw() 

Data Visualization: Plot of Means

means <- mtcars %>%
  group_by(cyl) %>%
  summarize(mean = mean(mpg)) %>%
  ungroup()

means %>% 
  ggplot(aes(y = mean, x = cyl)) + 
  geom_point(size = 5) + 
  labs(x = "Number of Cylinders",
       y = "Average Gas Mileage") +
  theme_bw() 

Data Visualization: Plot of Means

Data Visualization: Plot of Means

means <- mtcars %>%
  group_by(cyl, am) %>%
  summarize(mean = mean(mpg)) %>%
  ungroup()

means %>% 
  ggplot(aes(y = mean, x = cyl, color = as.factor(am))) + 
  geom_point(size = 5) + 
  labs(x = "Horsepower",
       y = "Gas Mileage",
       color = "Transmission") +
  scale_color_manual(labels = c("Automatic", "Manual"),
                     values = c("#003865", "#8DC8E8")) +
  theme_bw() 

Data Visualization: Plot of Means

Data Visualization: Plot of Means

means <- mtcars %>%
  group_by(cyl, am) %>%
  summarize(mean = mean(mpg),
            sd = sd(mpg)) %>%
  ungroup()

means %>% 
  ggplot(aes(y = mean, x = cyl, color = as.factor(am))) + 
  geom_point(size = 5) + 
  geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width = 0.15) +
  labs(x = "Horsepower",
       y = "Gas Mileage",
       color = "Transmission") +
  scale_color_manual(labels = c("Automatic", "Manual"),
                     values = c("#003865", "#8DC8E8")) +
  theme_bw() 

Data Visualization: Plot of Means

Data Visualization: Bar Chart

mtcars %>% ggplot(aes(x=as.factor(cyl))) +
  geom_bar() +
  labs(x = "Number of Cylinders",
       y = "Number of Cars") +
  theme_bw()

Data Visualization: Bar Chart

mtcars %>% ggplot(aes(x=as.factor(cyl), fill=as.factor(am))) +
  geom_bar(position = 'dodge') +
  labs(x = "Number of Cylinders",
       y = "Number of Cars",
       fill = "Transmission") +
  scale_fill_manual(labels = c("Automatic", "Manual"),
                     values = c("#003865", "#8DC8E8")) +
  theme_bw()

Wrap Up

  • In lecture, we have reviewed how to describe data.

    • Summary statistics
    • Basic data visualization
  • There is not a one-size-fits-all graph!

    • Always keep in mind what is the story we are trying to tell and what aids in our explanation.
  • Next, we will review statistical inference.

    • Confidence intervals
    • Hypothsis testing